Week 1 - R and Stats recap

Marwin Carmo

Simple Linear Regression

  • We’re using the same dataset from last week
icecream <- read.csv("https://shorturl.at/qdZlt")
head(icecream)
  Temperature Sales      Pool
1       57.56   215 1.6101279
2       61.52   325 0.0000997
3       53.42   185 0.1348124
4       59.36   332 3.2882103
5       65.30   406 1.9550195
6       71.78   522 0.9825883

Simple Linear Regression

What is the equation for the regression line?

\[ \hat{Y_i} = b_0 + b_1 X_{1i} \]

We can calculate estimates of these values using their formulas:

\[ \begin{aligned} b_1 &= \frac{\mathrm{Cov}(X, Y)}{\mathrm{Var}(X)}\\ b_0 &= \bar{Y} - b_1 \bar{X} \end{aligned} \]

b1_estimate <- cov(icecream$Temperature, icecream$Sales)/var(icecream$Temperature)
b0_estimate <- mean(icecream$Sales) - b1_estimate * mean(icecream$Temperature)
b1_estimate
[1] 16.71548
b0_estimate
[1] -694.3695

With these values, we can write out our regression line:

\[ Sales_i = -694.37 + 16.72 \times Temperature_i \]

plot(icecream$Temperature, icecream$Sales, xlab = "Temperature (F)", 
     ylab = "Ice cream Sales")
abline(a = b0_estimate, b = b1_estimate, col = "red") 
# abline() is a function to add a straight line to a plot

In R we can use the lm() function (lm stands for “linear model”).

The function arguments are: lm(dependent_variable ~ independent_variable, data = data_name).

simple_regression <- lm(Sales ~ Temperature, data = icecream)

Let’s take a look at the output using the summary() function:

summary(simple_regression)

Call:
lm(formula = Sales ~ Temperature, data = icecream)

Residuals:
    Min      1Q  Median      3Q     Max 
-75.512 -12.566   4.133  22.236  49.963 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -694.369    105.048   -6.61 6.00e-05 ***
Temperature   16.715      1.592   10.50 1.02e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 38.13 on 10 degrees of freedom
Multiple R-squared:  0.9168,    Adjusted R-squared:  0.9085 
F-statistic: 110.2 on 1 and 10 DF,  p-value: 1.016e-06

Do those match what we calculated before? How do we interpret these values?

plot(icecream$Temperature, icecream$Sales, xlab = "Temperature (F)", 
     ylab = "Ice cream Sales", xlim = c(0, 80), ylim = c(-700, 700))
abline(a = b0_estimate, b = b1_estimate, col = "red") 

Hypothesis Testing for the Slope and R-Squared

In addition to the estimates of the intercept and slope, we have a standard error, test statistic, and p-value.

These are for testing the null hypothesis that the parameter is equal to 0 versus the alternative that it’s not equal to 0.

The test for the intercept is typically not very interesting to us. Instead, we’re more interested in the test for the slope. Because that tells us whether the predictor is useful in predicting the outcome.

The hypotheses for this test are:

  • \(H_0\): \(b_1\) = 0
  • \(H_1\): \(b_1 \neq\) 0

Based on our summary() output, do we reject or fail to reject the null hypothesis? What does that tell us?

Confidence intervals

We could also create a confidence interval around our estimate of the slope

\[ \mathrm{CI}_{95\%} = \hat{b_1} \pm t_{crit} \times \mathrm{SE} \]

\[ \mathrm{SE}_{b_1} = \frac{s_\varepsilon}{\sqrt{SS_X}} \]

\[ s_\varepsilon = \sqrt{\frac{1}{N-2}\sum_{i=1}^N (Y_i-\hat{Y_i})^2} \]

sd_error <- sqrt( (1/(nrow(icecream) - 2)) * sum((icecream$Sales - predict(simple_regression))^2) )
sd_error
[1] 38.1265

predict is a generic function for predictions from the results of various model fitting functions.

\[ SS_X = \sum_{i=1}^N (X_i - \bar{X})^2 \]

ss_x <- sum((icecream$Temperature - mean(icecream$Temperature))^2)
ss_x
[1] 573.4233
se_b1 <- sd_error/sqrt(ss_x)
se_b1
[1] 1.592169

The standard error for the slope is given in the output when we call the summary() function on our model. Take a look at it again and try to find it!

summary(simple_regression)$coefficients
              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -694.36947 105.048359 -6.609998 6.002426e-05
Temperature   16.71548   1.592169 10.498555 1.015893e-06

To get the 95%CI we have to estimate \(\pm \ t_{crit} \times \mathrm{SE}\) where \(t_{crit}\) is the critical value that cuts off the upper 2.5% of a t-distribution with \(N - 2\) df:

t_crit <- qt(.025, nrow(icecream) - 2, lower.tail = FALSE)

lower_limit <- b1_estimate - t_crit * 1.592
upper_limit <- b1_estimate + t_crit * 1.592

lower_limit
[1] 13.16828
upper_limit
[1] 20.26268

We can also obtain the 95% CI using the confint() function:

confint(object = simple_regression, level = 0.95)
                2.5 %     97.5 %
(Intercept) -928.4318 -460.30714
Temperature   13.1679   20.26305

R-Squared

R-squared is the proportion of total variability in our outcome that is explained by our predictor(s).

It is also given in the output of the summary() function:

summary(simple_regression)$r.squared
[1] 0.9168189

What does that mean?

You’ll notice that there is also an adjusted \(R^2\)

This is more important in the context of multiple regression because as you add more predictors to the model, your \(R^2\) will typically increase even if the added predictors are just explaining noise.

Therefore, adjusted \(R^2\) corrects for the number of predictors in the model.

Reporting the results

We ran a simple linear regression to determine whether temperature predicts ice cream sales. The effect of Temperature is statistically significant and positive (b = 16.72, 95% CI [13.17, 20.26], t(10) = 10.50, p < .001). Temperature explained 91.68% of the total variability in ice cream sales.